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Abstract 



The success of kernel-based learning methods depend on the choice of kernel. 
Recently, kernel learning methods have been proposed that use data to select the 
' most appropriate kernel, usually by combining a set of base kernels. We intro- 

tyj I duce a new algorithm for kernel learning that combines a continuous set of base 

O , kernels, without the common step of discretizing the space of base kernels. We 

demonstrate that our new method achieves state-of-the-art performance across a 
variety of real-world datasets. Furthermore, we explicitly demonstrate the im- 
^ . portance of combining the right dictionary of kernels, which is problematic for 

[--^ ' methods based on a finite set of base kernels chosen a priori. Our method is not 

, the first approach to work with continuously parameterized kernels. However, we 

■ show that our method requires substantially less computation than previous such 
■"sj" ' approaches, and so is more amenable to multiple dimensional parameterizations 

. of base kernels, which we demonstrate. 

1 Introduction 

> ■ 

■ A well known fact in machine learning is that the choice of features heavily influences the per- 
5-H ' formance of learning methods. Similarly, the performance of a learning method that uses a kernel 

function is highly dependent on the choice of kernel function. The idea of kernel learning is to use 
data to select the most appropriate kernel function for the learning task. 

In this paper we consider kernel learning in the context of supervised learning. In particular, we 
consider the problem of learning positive-coefficient linear combinations of base kernels, where 
the base kernels belong to a parameterized family of kernels, Here S is a "continuous" 

parameter space, i.e., some subset of a Euclidean space. A prime example (and extremely popular 
choice) is when is a Gaussian kernel, where a can be a single common bandwidth or a vector 
of bandwidths, one per coordinate. One approach then is to discretize the parameter space S and 
then find an appropriate non-negative linear combination of the resulting set of base kernels, N — 
, . . . , t(Tj,}. The advantage of this approach is that once the set N is fixed, any of the many 
efficient methods available in the literature can be used to find the coefficients for combining the base 
kernels in J\f (see the papers by L anckriet et al. 2004; Sonnenburg et al. 2006; Rakotomamoniv et af] 
120081; ICortes et al.ll2009at iKloft et al.ll201 ll and the references therein). One potential drawback of 
this approach is that it requires an appropriate, a priori choice of M. This might be problematic, 
e.g., if S is contained in a Euclidean space of moderate, or large dimension (say, a dimension over 
20) since the number of base kernels, p, grows exponentially with dimensionality even for moderate 
discretization accuracies. Furthermore, independent of the dimensionality of the parameter space, 
the need to choose the set M independently of the data is at best inconvenient and selecting an 
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appropriate resolution might be far from trivial. In this paper we explore an alternative method 
which avoids the need for discretizing the space S. 

We are not the first to realize that dis cretizing a continuous parameter space might be troublesome: 
The method of 'Argyriou etalj (l2005l |2006) can also work with continuously parameterized spaces 
of kernels. The main issue with this method, however, is that it may get stuck in local optima since it 
is based on alternating minimization an d the objective functio n is not jointly convex. Nevertheless, 
empirically, in the initial publications of iArgvriou et al.l ( l2005l 12006.) this method was found to have 
excellent and robust performance, showing that despite the potential difficulties, the idea of avoiding 
discretizations might have some traction. 

Our new method is similar to that of 'Argvri ou et al.l (l2005l l2006 l). in that it is still based on local 
search. However, our local search is used within a boosting, or more precisely, forward-stagewise 
additive modeling (FS AM) procedure, a method that is known to be quite robust to how its "greedy 
step" is implemented dHastie et al.L 12001 , Section 1 0.3). Thus, we expect to suff er minimally from 
issues related to local minima. A second difference to lArgvriou et al.l (|2005L|2006|) is that our method 
belongs to the group of two-stage kernel learning methods. The decision to use a two-stage kernel 
learning approach was motivated by the recent success of the two-stage method of Cortes et al. 
(2010). In fact, our kernel learning method uses the center ed kernel alignment met ric of lCortes et al. 
( 20101) (derived from the uncentered alignment metric of ICristianini et al] ( l2002h ) in its first stage 
as the objective function of the FSAM procedure, while in the second stage a standard supervised 
learning technique is used. 

The technical difficulty of implementing FSAM is that one needs to compute the functional gradient 
of the chosen objective function. We show that in our case this problem is equivalent to solving an 
optimization problem over ct G S with an objective function that is a linear function of the Gram 
matrix derived from the kernel n^- Because of the nonlinear dependence of this matrix on a, this 
is the step where we need to resort to local optimization: this optimization problem is in general 
non-convex. However, as we shall demonstrate empirically, even if we use local solvers to solve 
this optimization step, the algorithm still shows an overall excellent performance as compared to 
other state-of-the-art methods. This is not completely unexpected: One of the key ideas underlying 
boosting is that it is designed to be robust even w hen the individual "greedy" steps are imperfect 
(cf.. Chapter 12. lBuhlmann and van de Geed201ll ). Given the new kernel to be added to the existing 
dictionary, we give a computationally efficient, closed-form expression that can be used to determine 
the coefficient on the new kernel to be added to the previous kernels. 

The empirical performance of our proposed method is explored in a series of experiments. Our 
experiments serve multiple purposes. Firstly, we explore the potential advantages, as well as limita- 
tions of the proposed technique. In particular, we demonstrate that the procedure is indeed reliable 
(despite the potential difficulty of implementing the greedy step) and that it can be successfully used 
even when S is a subset of a multi-dimensional space. Secondly, we demonstrate that in some cases, 
kernel learning can have a very large improvement over simpler alternatives, such as combining 
some fixed dictionary of kernels with uniform weights. Whether this is true is an important issue 
that is given weight by the fact that just recently it became a subject of dispute (ICortesl 12009 ). Fi- 
nally, we compare the performance of our method, both from the perspective of its generalization 
capability and computational cost, to its natural, st ate-of-the-art a lternatives, such as the two-stage 
method of Cortes et al. ( 2010.) and the algorithm of lArgvriou et al. (2005, 2006). For this, we com- 
pared our method on datasets used in previous kernel-learning work. To give further weight to our 
results, we compare on more datasets than any of the previous papers that proposed new kernel 
learning methods. 

Our experiments demonstrate that our new method is competitive in terms of its generalization per- 
formance, while its computational cost is significantly less than that of its competitors that enjoy 
similarly good generalization performance as our method. In addition, our experiments also re- 
vealed an interesting novel insight into the behavior of two-stage methods: we noticed that two- 
stage methods can "overfit" the performance metric of the first stage. In some problem we observed 
that our method could find kernels that gave rise to better (test-set) performance on the first-stage 
metric, while the method's overall performance degrades when compared to using kernel combina- 
tions whose performance on the first metric is worse. The explanation of this is that metric of the 
first stage is a surrogate performance measure and thus just like in the case of choosing a surro- 
gate loss in classification, better performance according to this surrogate metric does not necessarily 
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transfer into better performance in the primary metric as there is no monotonicity relation between 
these two metrics. We also show that with proper capacity control, the problem of overfitting the 
surrogate metric can be overcome. Finally, our experiments show a clear advantage to using kernel 
learning methods as opposed to combining kernels with a uniform weight, although it seems that 
the advantage mainly comes from the ability of our method to discover the right set of kernels. This 
conclusion is strengthened by the fact that the closest competitor to our method was found to be the 
method of lArgyriou et al.l ( l2006l) that also searches the continuous parameter space, avoiding dis- 
cretizations. Our conclusion is that it seems that the choice of the base dictionary is more important 
than how the dictionary elements are combined and that the a priori choice of this dictionary may 
not be trivial. This is certainly true akeady when the number of parameters is moderate. Moreover, 
when the number of parameters is larger, simple discretization methods are infeasible, whereas our 
method can still produce meaningful dictionaries. 



2 The New Method 



The purpose of this section is to describe our new method. Let us start with the introduction of the 
problem setting and the notation. We consider binary classification problems, where the data T) = 
{{Xi,Yi), . . . , {Xn, Yn)) is a sequence of independent, identically distributed random variables, 
with {Xi,Yi) e M'^ X { — 1 , + 1 } . For convenience, we introduce two other pairs of random variables 
{X,Y), {X',Y'), which are also independent of each other and they share the same distribution 
with {Xi, Yi). The goal of classifier learning is to find a predictor, g : R'' — > {— 1,+1} such that 
the predictor's risk, L{g) = ¥{g{X) ^ Y), is close to the Bayes-risk, infg Li^g). We will consider a 
two-stage method, as noted in the introduction. The first stage of our method will pick some kernel 
: M'^ X K'' — > M from some set of kernels K. based on V, which is then used in the second stage, 
using the same data V to find a good predictor^ 

Consider a parametric family of base kernels, {Ka)a<iT,- The kernels considered by our method 
belong to the set 



/C = < > ^iKcr, : r e N, /ii > 0, oi £ E, 




i.e., we allow non-negative linear combinations of a finite number of base kernels. For exam- 
ple, the base kernel could be a Gaussian kernel, where ct > is its bandwidth: Ka-{x,x') ~ 
exp(— llx — x'lp/cr^), where x,x' G Mf^. However, one could also have a separate bandwidth 
for each coordinate. 

The "ideal" kernel underlying the common distribution of the data is k*{x,x') = 
E [YY' \ X — X, X' — x' ]. Our new meth od attempts to find a kernel k £ K. which is maximally 
aligned to this ideal kernel, where, following lCortes et al.l (1201 Ol) . the alignment between two kernels 
k,k is measured by the centered alignment metric^ 

A.ik^k)"-^' ^^^'^^^ 



kr kr 



where k^ is the kernel underlying k centered in the feature space (similarly for kc), {k, k) = 

E k{X^ X')k{X, X') and = (fc, k). A kernel k centered in the feature space, by definition, 

is the unique kernel fee, such that for any x, x' , kc{x, x') = {^(x) - E , $(a;') - E [<i>(X)]), 

where $ is a feature map underlying k. By considering centered kernels kc, kc in the alignment 
metric, one implicitly matches the mean responses E[fc(X, X')], E[fc(X, X')] before considering 
the alignment between the kernels (thus, centering depends on the distribution of X). An alterna- 
tive way of stating this is that centering cancels mismatches of the mean responses between the two 
kernels. When one of the kernels is the ideal kernel, centered alignment effectively standardizes the 
alignment by cancelling the effect of imbalan ced class distributio ns. For further discussion of the 
virtues of centered alignment, see the paper bv lCortes et alj ( |2010|) . 



'One could consider splitting the data, but we see no advantage to doing so. Also, the methods for the 
second stage are not a focus of this work and the particular methods used in the experiments are described later. 
^Note that the word metric is used in its everyday sense and not in its mathematical sense. 
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Algorithm 1 Forward stagewise additive modeling for kernel learning with a continuously 
parametrized set of kernels. For the definitions of /, F, F' and K : JC ^ R"^", see the text. 

1: Inputs: data V, kernel initialization parameter e, the number of iterations T, tolerance 6, max- 
imum stepsize r/,„ax > 0. 
2: K° ^ £/„. 
3: for i = 1 toTdo 

4: F ^ F'{K^-^) 

5: F^CnPCn 

6: cr* = argmax^gs {P,K(k„))f 

7: if' = C„X(K,.)Cn 

8: 77* = argmaxo<^<^_^^^ F{K*-^ + rjK') 

9: ^ K*-^ + T]* K' 

10: if F(if*) < F(K^-^) + 61 then terminate 
11: end for 



Since the common distribution underlying the data is unknown, one resorts to empirical approxima- 
tions to alignment and centering, resulting in the empirical alignment metric. 



\Kc\\f\\Kc\\f 

where, K = Xj))i<i.j<„, and K — {k{Xi, Xj))i<,;.j<n are the kernel matrices underlying 

k and k, and for a kernel matrix, K, Kc = CnKCn, where C„ is the so-called centering matrix 
defined by C„ = Inxn — /„xn being the n x n identity matrix and 1 = (1, . . . , 1)^ e 

M". The empirical counterpart of maximizing Ac{k, k*) is to maximize Ac{K, K*), where K* 
YY-'^, and Y = (Yi, . . . , Yn)^ collects the responses into an n-dimensional vector. Here, K is the 
kernel matrix derived from a kernel fc G /C. To make this connection clear, we will write K = K{k). 
Define / : /C ^ M by /(fc) = Ac{K{k), K*). 

To find an approximate maximizer of f, we propose a ste epest ascent approach Xo forward stagewise 
additive modeling (FSAM). FSAM faastie et all 1200 1) is an iterative method for optimizing an 
objective function by sequentially adding new basis functions without changing the parameters and 
coefficients of the previously added basis functions. In the steepest ascent approach, in iteration t, 
we search for the base kernel in (k^) defining the direction in which the growth rate of / is the 
largest, locally in a small neighborhood of the previous candidate fc*~^: 

cr, — arg max lim . ( 1 ) 

Once fjj is found, the algorithm finds the coefficient G < rjt < ?7maxE|such that f{k*^^ + ritKa-*) 
is maximized and the candidate is updated using fc* = fc*^^ + rjtKa*- The process stops when the 
objective function / ceases to increase by an amount larger than 9 > 0, or when the number of 
iterations becomes larger then a predetermined limit T, whichever happens earlier. 

Proposition 1. The value of a* can be obtained by 

< -argmax (if(K,),F'((if(fc*-i)),))^ , (2) 



where for a kernel matrix K, 



Kl-\ \K\\-/{K ,Kt)pK 

\k\\p\\k: 



F'{K) = ^-^ n-ilF \->-c/^ - _ 



The proof can be found in the supplementary material. The crux of the proposition is that the 
directional derivative in ([T]i can be calculated and gives the expression maximized in (|2]l. 



''in all our experiments we use the arbitrary value r/max ~ 1- Note that the value of ry^ax, together with the 
limit T acts as a regularizer. However, in our experiments, the procedure always stops before the limit T on the 
number of iterations is reached. 
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Table 1: List of the kernel learning methods evaluated in the experiments. The key to the naming 
of the methods is as follows: CA stands for "continuous alignment" maximization, CR stands for 
"continuous risk" minimization, DA stands for "discrete alignment", Dl, D2, DU should be obvious. 



Abbr. 


Method 


CA 


Our new method 


CR 


From Argvriou et aL (2005) 


DA 


From Cortes et al. (2010) 


Dl 


£i-norm MKL (Kloft et al., 201 F) 


D2 


^2-norm MKL (Kloft et al., 201 F) 


DU 


Uniform weights over kernels 



In general, the optimization problem (|2]l is not convex and the cost of obtaining a (good approximate) 
solution is hard to predict. Evidence that, at least in some cases, the function to be optimized is 
not ill-behaved is presented in Section IB. II of the supplementary material. In our experiments, an 
approximate solution to ^ is found using numerical methods^ As a final remark to this issue, note 
that, as is usual in boosting, finding the global optimizer in (|2]i might not be necessary for achieving 
good statistical performance. 

The other parameter, r/t, however, is easy to find, since the underlying optimization problem has a 
closed form solution: 

Proposition 2. The value ofrjt is given by rjt = argmax^g|-g_^. f{k^^^ + ^i^a* ), where if — 

max(0, {ad~ he)/ {hd — ae)) ifbd—ae ^ and rj* — otherwise, a = {K, K*) p, b — {K' , K*) p, 
c= {K,K)F,d = {K,K')F,e = {K',K')FandK ={K{k*-^))„ K' = 

The pseudocode of the full algorithm is presented in Algorithm [T] The algorithm needs the data, 
the number of iterations (T) and a tolerance (6) parameter, in addition to a parameter e used in 
the initialization phase and r^max- The parameter e is used in the initialization step to avoid divi- 
sion by zero, and its value has little effect on the performance. Note that the cost of computing a 
kernel-matrix, or the inner product of two such matrices is 0{n^). Therefore, the complexity of the 
algorithm (with a naive implementation) is at least quadratic in the number of samples. The actual 
cost will be strongly influenced by how many of these kernel-matrix evaluations (or inner product 
computations) are needed in (|2|. In the lack of a better understanding of this, we include actual 
running times in the experiments, which give a rough indication of the computational limits of the 
procedure. 

3 Experimental Evaluation 

In this section we compare our kernel learning method with several kernel learning methods on 
synthetic and real data; see Table [T] for the list of methods. Our method is labeled CA for Con- 
tinuous Ahgnment-based kernel learning. In all of the experiments, we use the following values 
wifli CA: T = 50, e = 10~^°, and 9 — 10^^. The first two methods, i.e. our algorithm, and 
CR (|Argvriou et al., 2005), are able to pick kernel parameters from a continuous set, while the rest 
of the algorithms work with a finite number of base kernels. 

In Section [TT] we use synthetic data to illustrate the potential advantage of methods that work with 
a continuously parameterized set of kernels and the importance of combining multiple kernels. We 
also illustrate in a toy example that multi-dimensional kernel parameter search can improve perfor- 
mance. These are followed by the evaluation of the above listed methods on several real datasets in 
Section im 

3.1 Synthetic Data 

The purpose of these experiments is mainly to provide empirical proof for the following hypotheses: 
(HI) The combination of multiple kernels can lead to improved performance as compared to what 

In particular, we use the fmincon function of Matlab, with the interior-point algorithm option. 
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can be achieved with a single kernel, even when in theory a single kernel from the family suffices 
to get a consistent classifier. (H2) The methods that search the continuously parameterized families 
are able to find the "key" kernels and their combination. (H3) Our method can even search multi- 
dimensional parameter spaces, which in some cases is crucial for good performance. 

To illustrate (HI) and (H2) we have designed the following problem: the inputs are generated from 
the uniform distribution over the interval [—10, 10]. The label of each data point is determined by 
the function y{x) = sign(/(a;)), where f{x) — sm{\/2x) + sin(-\/l2a;) + sin(V60a;). Training and 
validation sets include 500 data points each, while the test set includes 1000 instances. Figure [Tta) 
shows the functions / (blue curve) and y (red dots). For this experiment we use Dirichlet kernels of 
degree one0 parameterized with a frequency parameter a: Ka-{x, x') = 1 + 2 cos{a\\x — x'\\). 

In order to investigate (HI), we trained classifiers with a single frequency kernel from the set \/2, 
\fVl, and a/60 (which we thought were good guesses of the single best frequencies). The trained 
classifiers achieved misclassification error rates of 26.1%, 26.8%, and 28.6%, respectively. Clas- 
sifiers trained with a pair of frequencies, i.e. {\/2, \/T2}, {v^, ■\/60}, and {\/T2, \/60} achieved 
error rates of 16.4%, 20.0%, and 21.3%, respectively (the kernels were combined using uniform 
weights). Finally, a classifier that was trained with all three frequencies achieved an error rate of 
2.3%. 

Let us now turn to (H2). As shown in Figure [TJb), the CA and CR methods both achieved a mis- 
classification error close to what was seen when the three best frequencies were used, showing that 
they are indeed effective^ Furthermore, FigurefTlc) shows that the discovered frequencies are close 
to the frequencies used to generate the data. For the sake of illustration, we also tested the meth- 
ods which require the discretization of the parameter space. We choose ten Dirichlet kernels with 
a € {0, 1, . . . , 9}, covering the range of frequencies defining /. As can be seen from Figure [Tfb) 
in this example the chosen discretization accuracy is insufficient. Although it would be easy to in- 
crease the discretization accuracy to improve the results of these methodsQ the point is that if a high 
resolution is needed in a single-dimensional problem, then these methods are likely to face serious 
difficulties in problems when the space of kernels is more complex (e.g., the parameterization is 
multidimensional). Nevertheless, we are not suggesting that the methods which require discretiza- 
tion are universally inferior, but merely wish to point out that an "appropriate discrete kernel set" 
might not always be available. 

To illustrate (H3) we designed a second set of problems: The instances for the positive (negative) 
class are generated from a d = 50-dimensional Gaussian distribution with covariance matrix C = 
/dxd and mean [i\ — /Ojf^ (respectively, /X2 = —pi\ for the negative class). Here p = 1.75. The 

vector Q g [0, l]'* determines the relevance of each feature in the classification task, e.g. Bi ~ 
implies that the distributions of the two classes have zero means in the ith feature, which renders 
this feature irrelevant. The value of each component of vector B is calculated as Bi — [i/d)'' , where 
7 is a constant that determines the relative importance of the elements of 0. We generate seven 
datasets with 7 e {0, 1, 2, 5, 10, 20, 40}. For each value of 7, the training set consists of 50 data 
points (the prior distribution for the two classes is uniform). The test error values are measured 
on a test set with 1000 instances. We repeated each experiment 10 times and report the average 
misclassification error and alignment measured over the test set along with the running time. 

We test two versions of our method: one that uses a family of Gaussian kernels with a com- 
mon bandwidth (denoted by CA-ID), and another one (denoted by CA-nD) that searches in the 
space (Kcr)cre(o.oo)so, where each coordinate has a separate bandwidth parameter, n„{x,x') — 

exp(— X^iLi (^i ~ ^'iY l^f)- Since the training set is small, one can easily overfit while optimizing 
the alignment. Hence, we modify the algorithm to shrink the values of the bandwidth parameters to 



^We repeated the experiments using Gaussian kernels with nearly identical results. 

*In all of the experiments in this paper, the classifiers for the two-stage methods were trained using the 
soft margin SVM method, where the regularization coefficient of SVM was chosen by cross-validation from 

■j^q{-5,-4.5,... ,4.5,5} 

'Further experimentation found that a discretization below 0.1 is necessary in this example. 
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Figure 1: (a): The function f{x) = sm{^/2x) + sin(VT2.x) + sin(-\/60a::) used for generating 
synthetic data, along with sign(/). (b): Misclassification percentages obtained by each algorithm, 
(c): The kernel frequencies found by the CA method. 



their common average value by modifying (|2]): 

a: =argmin - {K{k,), F' {{K{k'-')),)) ^ 

+X\\a-a\\l, (4) 

where, a — ^ ^i=i '^i ^i^^ A is a regularization parameter. We also include results obtained for 
finite kernel learning methods. For these methods, we generate 50 Gaussian kernels with bandwidths 
a e mg^^' - '^^\ where m — 10^^, and g w 1.33. Therefore, the bandwidth range constitutes a 
geometric sequence from 10^"^ to 10"^. Further details of the experimental setup can be found in 
Section IB721 of the supplementary material. 

Figure |2] shows the results. Recall that the larger the value of 7, the larger the number of nearly 
irrelevant features. Since methods which search only a one-dimensional space cannot differentiate 
between relevant and irrelevant features, their misclassification rate increases with 7. Only CA-nD 
is able to cope with this situation and even improve its performance. We observed that without 
regularization, though, CA-nD drastically overfits (for small values of 7). We also show the running 
times of the methods to give the reader an idea about the scalability of the methods. The running time 
of CA-nD is larger than CA-ID both because of the use of cross-validation to tune A and because 
of the increased cost of the multidimensional search. Although the large running time might be a 
problem, for some problems, CA-nD might be the only method to deliver good performance amongst 
the methods studied^ 



3.2 Real Data 

We evaluate the methods listed in Table [1] on several binary classification tasks from 
MNIST and the UCI Letter recogniti on dataset, along with se veral other datasets from 
the UCI machine learning repository dFrank and Asuncioiil 1201 Ol) and Delve datasets (see, 
|http : / / www . cs . tor onto . edu/ -delve /data/ datasets . html| l. 



We have not attempted to run a multi-dimensional version of the CR method, since already the one- 
dimensional version of this method is at least one order of magnitude slower than our CA-ID method. 
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Figure 2: Performance and running time of various methods for a 50-dimensional synthetic problem 
as a function of the relevance parameter 7. Note that the number of irrelevant features increases with 
7. For details of the experiments, see the text. 

Table 2: Median rank and running time (sec.) of kernel learning methods obtained in experiments. 





CA-ID 


CA-nD 


CR 


DA 


Dl 


D2 


DU 




MNIST 


1 


N/A 


2 


4.5 


4.5 


5 


4 


Rank 


Letter 


1 


4.5 


2 


3.5 


7 


6 


5 




1 1 datasets 


3 


2 


3 


3 


4 


6 


6 


Time 


MNIST 
Letter 


12 ± 1 
9± 1 


N/A 
1986 ± 247 


377 ±56 
590 ±21 


31 ±1 
11 ±1 


57 ±6 
21 ± 1 


58 ±3 
22 ± 1 


10 ± 1 
5± 1 



MNIST. In the first experiment, following lArgyriou et al.l (12005), we choose 8 handwritte n digit 
recognition tasks of various difficulty from the MNIST dataset jLeCun and Cortesl |20IO|) . This 



dataset consists of 28 x 28 images with pixel values ranging between and 255. In these experiments, 
we used Gaussian kernels with parameter a: Ga-{x, x') = exp(— ||a; — x'lp/cr^). Due to the large 
number of attributes (784) in the MNIST dataset, we only evaluate the 1 -dimensional version of our 
method. For the algorithms that work with a finite kernel set, we pick 20 kernels with the value of cr 
picked from an equidistant discretization of interval [500, 50000]. In each experiment, the training 
and validation sets consist of 500 and 1000 data points, while the test set has 2000 data points. 
We repeated each experiment 10 times. Due to the lack of space, the test-set error plots for all of 
the problems can be found in the supplementary material (see Section |B3T i. In order to give an 
overall impression of the algorithms' performance, we ranked them based on the results obtained 
in the above experiment. Table |2] reports the median ranks of the methods for the experiment just 
described. 

Overall, methods that choose a from a continuous set outperformed their finite counterparts. This 
suggests again that for the finite kernel learning methods the range of a and the discretization of this 
range is important to the accuracy of the resulting classifier. 



UCI Letter Recognition. In another experiment, we evaluated these methods on 12 binary clas- 
sification tasks from the UCI Letter recognition dataset. This dataset includes 20000 data points of 
the 26 capital letters in the English alphabet. For each binary classification task, the training and val- 
idation sets include 300 and 200 data points, respectively. The misclassification errors are measured 
over 1000 test points. As with MNIST, we used Gaussian kernels. However, in this experiment, we 
ran our method with both 1 -dimensional and n-dimensional search procedures. The rest of the meth- 
ods learn a single parameter and the finite kernel learning methods were provided with 20 kernels 
with cr's chosen from the interval [1, 200] in an equidistant manner The plots of misclassification 
error and alignment are available in the supplementary material (see Section lB3] ). We report the me- 
dian rank of each method in Tabled While the 1 -dimensional version of our method outperforms 
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the rest of the methods, the classifier buih on the kernel found by the multi-dimensional version of 
our method did not perform well. We examined the value of alignment between the learned kernel 
and the target label kernel on the test set achieved by each method. The results are available in the 
supplementary material (see Section IB31 i. The multidimensional version of our method achieved 
the highest value of alignment in every task in this experiment. Higher value of alignment between 
the learned kernel and the ideal kernel does not necessarily translate into higher value of accuracy 
of the classifier Aside from this observation, the same trends observed in the MNIST data can be 
seen here. The continuous kernel learning methods (CA-ID and CR) outperform the finite kernel 
learning methods. 

Miscellaneous datasets. In the last experiment we evaluate all methods on 1 1 datasets chosen 
from the UCI machine learning repository and Delve datasets. Most of these datasets were u s ed pre- 
viously to evaluate kernel le arning algorithms dLanckriet et al.L 120041: ICortes et all l2009allR 120101: 
iRakotoma moniv et al .'. 2008). The specification of each dataset and the performance of each method 
are available in the supplementary material (see Section |B3] i. The median rank of each method is 
shown in Table |2] Contrary to the Letter experiment, in this case the multi-dimensional version of 
our method outperforms the rest of the methods. 

Running Times. We measured the time required for each run and each kernel learning method in 
the MNIST and the UCI Letter experiments. In each case we took the average of the running time 
of each method over all tasks. The average required time along with the standard error values are 
shown in Tabled Among all methods, the DU method is fastest, which is expected, as it requires no 
additional time to compute kernel weights. The CA-ID is the fastest among the rest of the methods. 
In these experiments our method converges in less than 10 iterations (kernels). The general trend is 
that one-stage kernel learning methods, i.e., Dl, D2, and CR, are slower than two-stage methods, 
CA and DA. Among all methods, the other continuous kernel learning method, CR, is slowest, since 
(1) it is a one-stage algorithm and (2) it usually requires more iter ations (around 50) to converge. 
We also examined the DC-Programming version of the CR method Argvr iou et al.l (l2006 i). While it 
is faster than the original gradient-based approach (roughly three times faster), it is stiU significantly 
slower than the rest of the methods in our experiments. 

4 Conclusion and Future Work 

We presented a novel method for kernel learning. This method addresses the problem of learning 
a kernel in the positive linear span of some continuously parameterized kernel family. The algo- 
rithm implements a steepest ascent approach to forward stagewise additive modeling to maximize 
an empirical centered correlation measure between the kernel and the empirical approximation to the 
ideal response-kernel. The method was shown to perform well in a series of experiments, both with 
synthetic and real-data. We showed that in single-dimensional kernel parameter search, our method 
outperforms standard multiple k ernel learning m ethods without the need to discretizing the param- 
eter space. While the method of lArgyriou et al.l (12005) also benefits from searching in a continuous 
space, it was seen to require significantly more computation time compared to our method. We 
also showed that our method can successfully deal with high-dimensional kernel parameter spaces, 
which, at least in our experiments, the method of Argyriou et al. (2005, 2006) had problems with. 

The main lesson of our experiments is that the methods that start by discretizing the kernel space 
without using the data might lose the potential to achieve good performance before any learning 
happens. 

We think that currently our method is the most efficient method to design data-dependent dictio- 
naries that provide competitive performance. It remains an interesting problem to be explored in 
the future whether there exist methods that are provably efficient and yet their performance remains 
competitive. Although in this work we directly compared our method to finite-kernel methods, it is 
also natural to combine dictionary search methods (like ours) with finite-kernel learning methods. 
However, the thorough investigation of this option remains for future work. 

A secondary outcome of our experiments is the observation that although test-set alignment is gen- 
erally a good indicator of good predictive performance, a larger test-set alignment does not neces- 
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sarily transform into a smaller misclassification error. Although this is not completely unexpected, 
we think that it will be important to thoroughly explore the implications of this observation. 
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A Proofs 

A. 1 Proof of Proposition |T] 

First, notice that the Hmit in ([T]i is a directional derivative, D^^f{k'^~^). By the chain rule, 

where, for convenience, we defined Fc{K) = Ac{K, K*). Define 

F{K)^{K,k*)p/{\\K\\p\\kl\\p) 
so that Fc{K) = F{Kc). Some calculations give that 

^ k*-\\K\\f{K,k*)FK 

ll^llFlli^ellF 

(which is the function defined in (|3]l). We claim that the following holds: 

Lemma 3. F'^{K) = CnF'{K^)Cn. 

Proof. By the definition of derivatives, as iJ ^ 0, 

F{K + H)-F{K) = {F'{K),H)f + o{\\H\\). 



Also, 
Now, 



F,{K + H)- F,{K) = {F^iK), H) p + o{\\H\\). 
F,{K + H)- F,{K) = F{CnKCn + C„FC„) - FiC^KCn) 

= {F'{Kc),CnHCn)F+o{\\H\\) 
= {CnF\K,)Cn.H)F+o{\\H\\), 

where the last property follows from the cyclic property of trace. Therefore, by the uniqueness of 
derivative, F'^{K) = CnF'{Kc)Cn- □ 

Now, notice that CnF'{Kc)Cn = F'(Kc)- Thus, we see that the value of cr^ can be obtained by 

=argmax (X(K,),^^'((X(fc*-i))J) , 

which was the statement to be proved. 
A.2 Proof of Proposition |2] 

Let g{ri) = ,f{k*^^ + rjHa^ ). Using the definition of /, we find that with some constant p > 0, 

a + 677 



{c + 2di] + eri'^y/'^' 



Notice that here the denominator is bounded away from zero (this follows from the form of the 
denominator of /). In particular, e > 0. Further, 

lim .9(77) = - lim g{r]) = p-^. (5) 

Taking the derivative of g we find that 

be — ad + {bd — ae)rj 



g'iv) = p- 



(c + 2dr] + £772)3/2 



Therefore, g' has at most one root and g has at most one global extremum, from which the result 
follows by solving for the root of g' (if g' does not have a root, g is constant). 
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Figure 3: The flipped objective function underlying (l2]i as a function of a, the parameter of a Gaus- 
sian kernel in selected MNIST and UCI Letter problems. Our algorithm needs to find the minimum 
of these functions (and similar ones). 



B Details of the numerical experiments 

In this section we provide further details and data for the numerical results. 
B.l Non-Convexity Issue 

As we mentioned in Section|2] our algorithm may need to solve a non-convex optimization problem 
in each iteration to find the best kernel parameter Here, we explore this problem numerically, by 
plotting the function to be optimized in the case of a Gaussian kernel with a single bandwidth param- 
eter In particular, we plotted the objective function of Equation|2]with its sign flipped, therefore we 
are interested in the local minima of function h{a) = - {K{k„), F' {{K{k^-^))c)) p, see Figure 
|3] The function h is shown for some iterations of some of the tasks from both the MNIST and the 
UCI Letter experiments. The number inside parentheses in the caption specifies the corresponding 
iteration of the algorithm. On these plots, the objective function does not have more than 2 local 
minima. Although in some cases the functions have some steep parts (at the scales shown), their 
optimization does not seem very difficult. 

B.2 Details of the SO-dimensional synthetic dataset experiment 

The 1-dimensional version of our algorithm, CA-ID, and the CR method, employ Matlab's fmincon 
function with multiple restarts from the set 10^ ''^J^, to choose the kernel parameters. The multi- 
dimensional version of our algorithm, CA-nD, uses fmincon only once, since in this particular 
example the search method runs on a 50-dimensional search space, which makes the search an 
expensive operation. The starting point of the CA-nD method is a vector of equal elements where 
this element is the weighted average of the kernel parameters found by the CA-ID method, weighted 
by the coefficient of the corresponding kernels. 

The soft margin SVM regularization parameter is tuned from the set lO^^^ "^-^' - '*-^-^} using an 
independent validation set with 1000 instances. We also tuned the value of the regularization pa- 
rameter in Equation (|4|i from 10{^5,...,i4} ^gjJJg (jje same validation set (the best value of A is the 
one that achieves the highest value of alignment on the validation set). We decided to use a large val- 
idation set, following essentially the practice of Kloft et al.l (1201 1 , Section 6.1), to make sure that in 
the experiments reasonably good regularization parameters are used, i.e., to factor out the choice of 
the regularization parameters. This might bias our results towards CA-nD, as compared to CA-ID, 
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Figure 4: Alignment values in the 50-dimensional synthetic dataset experiment. 



though similar results were achieved with a smaller validation set of size 200. As a final detail note 
that Dl, D2 and CR also use the validation set for choosing the value of their regularization factor, 
and together with the regularizer, the weights also. Hence, their results might also be positively 
biased (though we don't think this is significant, in this case). 

The running times shown in Figure |2] include everything from the beginning to the end, i.e., from 
learning the kernels to training the final classifiers (the extra cross-validation step is what makes 
CA-nD expensive). 

Figure|4]shows the (centered) alignment values for the learned kernels (on the test data) as a function 
of the relevance parameter 7. It can be readily seen that the multi-dimensional method has a real- 
edge over the other methods when the number of irrelevant features is large, in terms of kernel 
alignment. As seen on Figure |4] this edge is also transformed into an edge in terms of the test-set 
performance. Note also that the discretization is fine enough so that the alignment maximizing finite 
kernel learning method DA can achieve the same ahgnment as the method CA-ID. 

B.3 Detailed results for the real datasets 



odd vs. even vs. 6 vs. 9 1 vs. 7 




Figure 5: Misclassification percentages in different tasks of the MNIST dataset. 
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Figure 6: Misclassification percentages in different tasks of the UCI Letter recognition dataset. 
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Figure 7: Alignment values in different tasks of the UCI Letter recognition dataset. 
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Table 3: Datasets used in the experiments 



Dataset 


# features 


# instances 


Training size 


Validation size 


Test size 


Banana 


2 


5300 


500 


1000 


2000 


Breast Cancer 


9 


263 


52 


78 


133 


Diabetes 


8 


768 


153 


230 


385 


German 


20 


1000 


200 


300 


500 


Heart 


13 


270 


54 


81 


135 


Image Segmentation 


18 


2086 


400 


600 


1000 


Ringnorm 


20 


7400 


500 


1000 


2000 


Sonar 


60 


208 


41 


62 


105 


Splice 


60 


2991 


500 


1000 


1491 


Thyroid 


5 


215 


43 


64 


108 


Waveform 


21 


5000 


500 


1000 


2000 



banana breast cancer diabetes 

30 




Figure 8: Misclassification percentages obtained in 11 datasets. 
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